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Abstract 

One of the most widely used methods for solving average cost MDP problems is the 
value iteration method. This method, however, is often computationally impractical and 
restricted in size of solvable MDP problems. We propose acceleration operators that improve 
the performance of the value iteration for average reward MDP models. These operators 
are based on two important properties of Markovian operator: contraction mapping and 
monotonicity. It is well known that the classical relative value iteration methods for average 
cost criteria MDP do not involve the max-norm contraction or monotonicity property. To 
overcome this difficulty we propose to combine acceleration operators with variants of value 
iteration for stochastic shortest path problems associated average reward problems. 
Keywords: Markov decision processes, stochastic shortest path problem, value iteration, 
accelerated convergence, linear programming. 



1 Introduction 

One of the most widely used methods for solving average MDP problems is the value iteration 
method. In general, this method is often appears to be computationally impractical and 
restricted in size of solvable MDP problems. Shlakhter, et al. [8] proposed acceleration 
operators to speed up the convergence of value iteration algorithms for discounted MDP 
models. These operators were based on the contraction property of Markovian operators 
for discounted MDPs. In this paper we will show how similar techniques can be used to 
accelerate the convergence of the value iteration for average reward MDP models. 

It is well known that the classical relative value iteration methods for average cost cri- 
teria MDP, unlike the discounted and the expected total-cost (stochastic shortest path) 
models do not involve the max-norm contraction or monotonicity property. Bertsekas jU [2] 
proposed an elegant way of constructing variants of the relative value iteration algorithm 
using the connection between the average cost models and corresponding stochastic short- 
est path problems, which possesses the above properties under assumption that all policies 
are unichain and there exists a recurrent state under all policies. We will show how this 
approach, combined with acceleration operators technique, can be used to get more efficient 
variants of value iteration for average reward MDP models. 

The rest of the paper is organized as follows. In Sections 12.11 and [2 . 21 we briefly present the 
accelerated operators method for discounted MDPs and Bertecas' variants of value iteration 
algorithm for the average reward MDP. In Section 12.31 we introduce variants of acceler- 
ated value iteration algorithms for average reward MDPs. Section 3 presents the numerical 
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studies of the proposed accelerated value iteration algorithms. In Section 4 we discuss the 
computational complexity of accelerated operators. Finally, Section 5 concludes the paper. 



2 Accelerated Value Iterations Algorithms 

2.1 Accelerating Operators (Infinite Horizon Discrete Time Dis- 
counted MDP Case) 

Consider an infinite horizon Markov decision process (MDP) with a finite set of states de- 
noted by S, a finite set of actions A(i) for each state i G S, an immediate reward r(i, a) for 
each i G S and a G A = Ui & sA(i), and transition probabilities Pij(a) for the i,j G S and 
a G A(i). The objective is to determine Vi, the minimum expected total discounted reward 
over an infinite horizon starting in state i, where a is the discount factor (0 < a < 1). It is 
well known |6j that v satisfies the optimality equation 



The optimality equation given in Equation (j4j) can be written, with the definition of the 
operator T on U, in the following vector notation. 



where II is the set of policies. There are several standard methods for finding optimal or ap- 
proximately optimal policies for the discounted MDP models. Approaches widely employed 
to solve MDP problems include value iteration, policy iteration, and linear programming ap- 




(1) 



v = Tv = min {r^ + aPdV } 



(2) 
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proach [6]. Shlakhter et al. [8] proposed accelerating operators to improve the convergence 
of the value iteration algorithm. Let us briefly discuss this technique. Consider the linear 
programming formulations for discounted MDPs 

max | h(i) - a^2Pij(a)hQ) < r (h a), Vi G T~n, Va e A(i), h G R n . J (3) 

Let V be the feasible set of linear program. It can be determined by thesystem of 
inequalities 

V = < /i | - a^J^y(a)/i(j) < r(i, a),Vi G l,n,Va G G 
I i=i 

Let T be the Markovian operator 

Th(i) = min < r(i, a) + a 7 pij(a)h(j) 
aeA(i) ^ 

The most crucial observation, that leads to characterization of the acceleration operators, 
is that the set V is invariant under T, as formally stated in Lemma 12.11 

Lemma 2.1. V is invariant under T. That is, TV C V. 

Lemma 12.11 suggests the following conditions for acceleration operator Z on defined set 

V. 

Acceleration Conditions 

(A) ZV C V, 

(B) Zv >v,Vve V. 
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Remark 2.1. It is easy to show that if v* is the fixed point of operator T, v n+1 = Tv n , 

w n+1 = ZTw n , and to = v° then 

(i) v* >w n > v n , 

(ii) v* = lim^oo w n = ]im n -+ 0O v n , 

(iii) \\v* — w n \\ < \\v* — v n \\. 

(iv) The sequence {w n } converges globally with order 1 at a rate less than or equal to a; 
its global asymptotic average rate of convergence is less than or equal to a. 

In [8] we presented two particular realization of acceleration operator Z: Projective 
Operator and Linear Extension Operator. 

For a given operator satisfying two conditions (A) and (B), several variants of accelerated 
value iteration algorithm were be suggested. More detailed discussion regarding application 
of this method to discounted MDP models can be found in the original paper. Further we 
will present the extended explanation of this technique applied to average reward MDPs. 

2.2 Bertsecas Approach 

The most important properties of the Markovian operator T of discounted or total cost (stochastic 
shortest path) used in the previous section are contraction property with respect to max-norm 
and monotonicity property. It is well known, however, that the classical value iteration meth- 
ods for average cost criteria MDP, unlike discounted and total cost models, do not involve 
the max-norm contraction or monotonicity property. This means that direct application 
of acceleration operators technique to average MDP models is impossible. Bertsekas [TJ, [2] 
proposed a way of constructing the variants of the relative value iteration algorithm, using 
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the connection between the average cost models and the corresponding stochastic shortest 
path problem. Under assumption that all policies are unichain and there exists a recurrent 
state under all policies, the Markovian operator is a contraction mapping with respectto 
a weighted max-norm, and possesses the monotonicity property. We will show how this 
approach combined with acceleration operators technique can be used to get more efficient 
variants of value iteration for average reward MDP models. 

Let us briefly describe the approach proposed by Bertsakas [IJ [2] . Consider an infinite 
horizon Markov decision process (MDP) with a finite set of states denoted by S, a finite 
set of actions A(i) for each state i G S, an immediate reward r(i, a) for each i G S and 
a E A = Ui<zsA(i), and a transition probability Pij(a) for each i,jES and a G A(i). The 
objective is to determine A*, the minimum average cost per state over an infinite horizon 
starting in state i, which satisfies the optimality equation 



where h* is a differential vector. 

Let us assume that there is a state, denoted by n, which is a recurrent state under any 
stationary policy. Consider the stochastic shortest path problem (denoted A-SSP) obtained 
from original model by adding a new state t such that p' it = p in for each i G S, p\, = pij 
for each i G S, j G S \ {n}, and p' in = for each i G S, where p'^ is the transition 
probability of the new model. The costs of the new model will be equal to rj(a) — A for 
each i G S, where A is a scalar parameter, and r t (-) = for terminating state t. Let h ni \{i) 
be the total expected cost of stationary policy tt starting from state i, and let the function 
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h\(i) = max T ^A(i), i = n. One can show that the functions h\(i) are concave, 
monotonically decreasing, and piecewise linear as functions of A, and that 

h x (n) = if and only if A = A* (5) 

Furthermore, the vector h\*, together with A*, satisfies the optimality equation 

Ae + hx* = Th x *. (6) 

As it was shown in [TJ [2] A* can be found using one- dimensional search procedure. It 
requires solution of several associated stochastic shortest path problems, for which value of 
parameter A is updated as 

\ k+l = \ k + 1 k h xk (n) (7) 

where h X k{n) is the optimal solution of A fc -SSP which can be found using the value iteration 
in form 

h m+1 (i) = min { r(i,a) - \ k + Pij (a)h m (j)\ for alH G IT 

a6A(i) [ ^ J 



n. 



with X k fixed throughout the value iteration procedure. 

Remark 2.2. As it was shown in [fl [2], the sequence of iterates (X k ,hxk(n)) converges to 
(A*, h\*(n)), provided that stepsize •y k < max X N ^ , where N n (i) is the expected value of 
the first positive time that n is reached under 7r starting from state i. It is also easy to see 
that, if additionally h°(n) < 0, then the sequence A fc converges monotonously to A*, which 
is X k | A*. 

The more efficient form of the algorithm proposed in [Tj, [2] is to update parameter A fc 



for each iteration 



n-l 



h k+1 (i) = min { r(i, a) - X k + V Pij (a)h n (j) } for all i G 1, 
aeA(i) I ^ I 



(9) 



where the parameter \ k+1 = \ k + r y k h k+l (n), 7™ is a positive, sufficiently small step size, and 
h k+1 is a current approximation of optimal solution of h X k+i of corresponding A fe+1 -SSP. It 
was also proposed the improved variant of above algorithm, which is based on the following 
inequality 



where 



ir < A < p 



f3 k = \ k + min 



mm[h k+1 {i) -h k (i)},h k+1 (n) 



(10) 



:n) 



j3 k = X k + max 



max[/i fe+1 (i) -h k (i)},h k+1 (n) 



(12) 



Using this inequality it is possible to replace the iteration A fc+1 = A fc + ^y k h k+1 (n) by 
\ k+1 = n fc [A fc + ^ k h k+1 {n)\ , where n fc [c] denotes the projection of a scalar c on the interval 



max p m , min f3 

m=0,...,fc — m=0,...,k 



(13) 



Remark 2.3. Since the absorbing state t has cost, the cost of any policy starting from t is 0. 
Because of this we can ignore the the component corresponding to the state t and exclude it 
from summation. Since for all states i the transition probabilities Pi n (a) = the component 
corresponding to the state n can be also ignored. 
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Remark 2.4. Though the state space of the original stochastic shortest path problems has 
the dimension n + 1, taking into consideration the statement of first part of remark 12.31 and 
because the Markovian operator (JE]) and ([9]) does not change the value in state t, we will 
consider this problem as problem in n-dimensional state space. 

The proof of the convergence of this algorithm can be found in original paper (see [2]). 
Using this lemma it is easy to show that 

\\h k -h x *\\ < \\h k -h xk \\ + \\h X k-h x *\\ < \\h k -h xk \\ +0(|A fc - A*|) (14) 

It was shown in pD, [2] that both of sequences h k and \ k converge at rate of a geometric 
progression. 

One of the disadvantages of the proposed algorithms is that the rate of convergence 
of these methods is relatively slow. To bypass this we propose the acceleration operators 
technique introduced in [S] to accelerate convergence of discounted MDP models. As we 
will show, this approach, which speeds up the convergence of value iteration algorithms, can 
be very efficient for solving average cost MDPs. 

2.3 Accelerating Operators Approach (Average Reward MDP Case) 

In this section we show how the acceleration operators technique can be applied to average 
reward MDP models 
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2.3.1 Reduction to Discounted Case 

The acceleration operators can be directly applied to the following special case of the average 
MDP [lj. Assume that there is a state t such that for some (3 > we have pu{a) > f3 for 
each i G S. One can see that (1 — (3) -discounted problem with the same state space, and 
actions, and transition probabilities 

f (1 - (3)- l Pij (a) ifj/t, 

ViM) = ( 15 ) 

L (l-/3)-i( Pii (a)-/3) if j = t. 
Then j3v(t) and v(i) are optimal average and differential costs, where v is the optimal cost 
function of the corresponding (1 — /5)-discounted problem. 

Another straightforward application of the acceleration operators is to use the following 
relationship 

A* = limfl - a)v a (i) (16) 

a—*l 

for each state i G S, where v a is an optimal cost vector for the corresponding a-discounted 
problem. 

Let's briefly discuss this relation. Getting a good approximation of the average cost A* 
using formula (|T5|) requires calculation of the expected total discounted cost for discount 
factor close to 1. In general, for an MDP with the discount factor close to 1 this problem 
is considered to be computationally very demanding. Accelerating operators show a highly 
efficient way of solving such MDPs, making this relation useful for finding the solutions of 
the average cost MDP problems. This relation can also be used to obtain the upper and 
lower bounds for optimal average cost. It is well known [6] that total discounted reward can 
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be expressed in terms of optimal average cost and optimal bias h as 

v a = (l-ai)- 1 \ + h + f(a) (17) 

where h = H p r, H p is a fundamental matrix H p = (I — P — P*) _1 (J — P*), and P* = 
limjv^oo jfEs{52tLi r (X t )}, and f(a) is a vector which converges to zero as a f 1. The 
additional property of the relationship (|T7|) can be formulated as following lemma 

Lemma 2.2. For any MDP problem there exists e (0 < e < 1) such that for any a G (e, 1) 
there exist two coordinates i and j that v a (i) < A* < v a (j) 

This means that min i f Q ,(i) and maxjt> a (i) provide the lower and upper bound for A* for 
all a sufficiently close to 1. Though the above lemma doesn't provide guidance on how to 
choose the discount factor a, the numerical studies show that taking the discount factor to 
be equal to 0.8 or higher gives us sufficiently good results (see Table [2] of Section [3]). 

2.3.2 Variants of Accelerated Value Iterations for Average Reward MDPs 

In this section we propose several variants of accelerated value iteration algorithms. Con- 
sider the following linear programming formulations, which are equivalent to associated 
A fc -stochastic shortest path problems 

{n-1 
^2h(i) | h(i) -J2Pij( a ) h U) < r(z,a) - A fc ,Vi G S,Va G A(i), h G M n . 
i j=i 

Let the set H X k be determined by the system of inequalities shown below: 

{n-1 
h I h(i) - J2Pij( a ) h U) < r (ha) - A fe ,Vi G S,\/a G A(i), h G R n 

10 



and T X k be Markovian operator 

{n.-l 
r{i, a) — X k + \ Pij(a)h(j) 

An observation, similar to Lemma 12.11 of the previous section characterizing the accel- 
eration operators, is that the set H X k is invariant under T X k as formally stated in Lemma 

E3I 

Lemma 2.3. H X k is invariant under T X k. That is, T X kH X k C H X k. 

The following statement holds: 

Lemma 2.4. If X k+1 < X k , then H X k C H X k+i. If X k+1 < X k , then H X k C int(H xk +i). 

It is easy to notice that if h k+1 (n) < 0, then A fe+1 = X k + ^ k h k+l (n) < X k . These two 
lemmas suggest the following conditions for acceleration operator on defined set H X k. 

Acceleration Conditions 

(A) Z k H X k C H X k, 

(B) Z k h > h,Whe H X k. 

The properties of operator Z k satisfying the conditions (A) and (B) are given in the 
following lemma 

Lemma 2.5. If h° G H x o, X k | A*, h k+1 = T X kh k , h k = Z k h k , then 

(i) h* = lim A ._ +00 /i fe = \im k ^ oc h k , 

(ii) h* > h X k >h k > h k , 
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(iii) \\h* — h k \\ > \\h* — h k \\, where ||-|| is a weighted m&x-norm. 

Later on we will present two particular realizations of acceleration operator Z. 

For a given operator satisfying two conditions (A) and (B), several variants of accelerated 
value iteration algorithms can be suggested. Notice that different acceleration operators 
may be used in different iterations of the value iteration algorithm, in which case Z\~ is an 
acceleration operator used in iteration k, instead of Z. 

Consider a variant of VI where A is kept fixed in all iterations until we obtain h X k , the 
optimal solution of A fc -SSP, and after that we calculate X k+1 = X h + r y k h X k. The acceleration 
operators Z^ such that h k+1 = Z^T X kh k make this algorithm computationally attractive. It 
is motivated by the fact that for this variant of VI we can guarantee the monotonicity of 
the sequence h X k (n) for stepsize satisfying the inequality 7 fe < max ^ ^ of Remark I2.2[ and 
as corollary the monotonicity of the sequence X k . This method can be improved using the 
following way of A update. If X k and X k+1 are two approximations of the value A* such 
that A* < A fc+1 < X k and hxk(n) > hxk+i(n) > 0, then it is guaranteed that A > A*, where 
A = A — t 2 ^ — ? n , — is a point of intersection of a straight line through points with 
coordinates (X k , hxk(n)) and (A fc+1 , h X k+i(n)) with A axis on (A, h\) graph. Then we can take 
yk+2 _ m in{A fc+1 + ■yh X k+i(n), A}. This A update is graphically illustrated in Figure ??. 



Figure ?? goes here. 
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A formal description of this variant of VI is given below: 

General Accelerated Value Iteration Algorithm 1 (GAVI 1) 

Step Select A such that h\o(n) < 0, h° = h° e H x o , set k = 0, and 
specify e > . 

Step 1 Compute h X k(i) = \im m ^ OQ ZkT X kh m {i) for all % G /. 

Step 2 If ||/i^fc(n)|| > e, go to Step 3. Otherwise compute X k+1 = min{A fc + 
7 fc /i A fe(ri), A| , where A = A fc — h ^ n ^ x — increase A; by 1 and return 

' h xk+1 (n)-h xk (n) •> 

to Step 1 . 

Step 3 Return with the actions attaining the minimum in Step 1. 

Remark 2.5. The choice of A satisfying the conditions of Step 1 is always available, because 
Amin < A* < A max , where A min = min ieJ)ae A(i) r(i, a) and A max = max, e ; iaeA(l) r(i, a). It is 
easy to see that /?-A max (0 < and h\ min {i) > for all i & I. 

Remark 2.6. The performance can be significantly improved if A is taken as a upper bound 
for A* from the statement derived at the end of the section 12.3.11 

An alternative variant of accelerated VI, though not very efficient, can be obtained from 
(Q, where the parameter A fc is updated at every iteration. The direct application of accel- 
eration technique to a variant with parameter A fc+1 = A fc + 7 fe /i fc+1 (n) is impossible, because 
the monotonicity of the sequence A fc can not be guaranteed. We propose to apply the accel- 
eration step only for iterates for which h h (n) remains negative. Although this method is not 
guaranteed to be applicable for all iterates, it can still be more efficient than the standard 
value iteration. A formal description of this variant of VI is given below: 
General Accelerated Value Iteration Algorithm 2 (GAVI 2) 
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Step Select A such that h x o(n) < 0, h° = h° e H x o , set k = 0, and 
specify e > . 

Step la Compute h k+1 (i) = Z k T X kh k (i) for all i E I until remains 
negative 

Step lb Compute h k+1 (i) = T X kh k (i) for all i el otherwise 

Step 2 If \\h k+1 — h k \\ > e, go to Step 3. Otherwise increase k by 1 
and return to Step 1 . 

Step 3 Return with the actions attaining the minimum in Step 1. 

Now we will introduce one more variant of the accelerated value iteration algorithm. As 
we mentioned earlier, the functions h x (i) are concave, monotonically decreasing, piecewise 
linear function of A, and h x (n) = if and only if A = A*. This means that A* can be 
found using the one-dimensional search procedure. This procedure may be computationally 
intractable, because it requires the solution of several associated expected total cost prob- 
lems. Using the acceleration technique improves this approach significantly. A variant of 
such search procedure can be obtained from GAVI 1, where the parameter X h is updated as 
\ k = (A fc min + A fc max )/2, where lower and upper bounds for A* are obtained at k-th iteration. 
Here some explanations are required. We can take A° m i n = A m i n and A° max = A max defined in 
Remark [2.51 or derived at the end of the section [2.3.11 For A = (A° min + A° max )/2 find h x o. 
If h x o(n) < 0, then AVin = A° min and AVax = A 1 , otherwise AVin = A 1 and A 1 ^ = A° max 
etc. A formal description of this variant of VI is given below: 
General Accelerated Value Iteration Algorithm 3 (GAVI 3) 

Step Select A° min = A min and A° max = A max , h° = h° e H x o m ^, set k = 
0, and specify e>0. 
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Step 1 Select \ k+1 = (A\ lin + A fe max )/2 , . 

Step 2 Compute h X k(i) = lim m _ >00 Z k T X kh m (i) for all % E I. 

Step 3 If ||/i A fc(n)|| < e, go to Step 4. Otherwise if h X k(n) < 0, then 
A fc+1 m in = A fe min and A fc+1 max = A fc , otherwise A fe+1 min = A fc and A fc+1 max = 
A fe max . increase A: by 1 and return to Step 1 . 

Step 4 Return with the actions attaining the minimum in Step 1. 

Now we will propose an acceleration operator satisfying two conditions (A) and (B) 
with significant reduction in the number of iterations before convergence and little additional 
computation in each iteration, so that the overall performance is greatly improved. Now we 
propose an acceleration operator that requires little additional computation per iteration 
but reduces the number of iterations significantly. 

Projective Operator 

For h k E V X k, Z k : h k — > h k + a*e, where a* is the optimal solution of the following trivial 
1-dimensional optimization problem: 

max{V"/i(i) +na | T X k+i (h k + ae) > h k + ae\ . (19) 



Figure ?? goes here. 
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Having a single decision variable in the above optimization problem, it is straightforward 
to find the optimal solution. The role of Projective Operator is graphically illustrated in 
Figure ??, where Z k projects the given point hk € Hk to the boundary of Hk+i- 

Theorem 2.1. For any index k Projective Operator Zk satisfies the conditions (A) and 
(B). 

We will call GAVI 1, GAVI 2, and GAVI 3 with Projective Operator as Projective Ac- 
celerated Value Iteration 1, 2, and 3, or PAVI 1, PAVI 2, and PAVI 3 for short in the paper. 

Now we present another acceleration operator that satisfies Acceleration Conditions 
(A) and (B). 

Linear Extension Operator 

For h h e H, Z k : v — > h k + a*(h k — h k ~ r ), where h k = T^/i* -1 and a* is the optimal solution 
to the following linear program: 

max j^/if + aJ2( h i ~ I T{h k + a{h k - h^ 1 )) > h k + a{h k - /i fc_1 )}. (20) 

Figure [3] graphically illustrates how Linear Extension Operator works. It casts T X kh k 
in the direction of Th k — h k to the boundary of the set H X k+i. Since h k 6 H X k, we have 
T X kh k > h k , which is an improving direction. As a result, Linear Extension Operator moves 
Th k closer to the point h*. 
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Figure [3] goes here. 



Theorem 2.2. Linear Extension Operator G satisfies the conditions (A) and (B). 

When Linear Extension Operator G is used in place of Z in Step 1 of GAVI, we call the 
algorithm Linear Extension Accelerated Value Iteration or LAVI for short in the paper. 

Interesting aspect of the proposed approach is that it can be used in Gauss-Seidel variant 
of value iteration algorithms. 

Gauss-Seidel: h k+1 = T GSxk h k where 

h k+1 (i) = min jr(*,a) - X k + T PlJ (a)h k+1 (j) + V Pii (a)/i fc (i) 1 , V* G T~n. (21) 
We start with the following definition of set: 

H GSxk = {h G R n | h < T GSxk h}- 

The following lemma is an analogue of Lemma 12.31 

Lemma 2.6. Hcs xk i> s invariant under T G s xk ■ 

With Lemma 12.61 acceleration operators satisfying conditions (A) and (B) can be used 
in Step 1 of GAVI with the variants T G s xk of the standard operator T X k. However, it is 
not trivial to define H G s xk with a set of linear inequalities and the acceleration operators 
proposed in this research will not work. To avoid the problem, we restrict the acceleration 
operators to a strict subset of H G s xk ■ 
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Lemma 2.7. The following relation holds 

H x *cH GSxk . (22) 
Remark 2.7. Gauss-Seidel methods require special consideration, since in general H X k ^ 

H GSxk m- 

Lemma 2.8. 

T cs xk (H GSxk ) C H X k . (23) 
Theorem 2.3. The set H X k is invariant under T G s xk - That is, 

Tcs xk H X k C H X k. 

Theorem 12.31 states that H xk is invariant under Tqs k , which suggests that this operator 
can replace T X k in GAVI to give rise to new accelerated value iteration algorithms. Therefore, 
we obtain several accelerated versions of the value iteration algorithm, which are conveniently 
written in the form XAYN, where 'X' is either "P" for "Projective" or "L" for "Linear 
Extension", 'A' is for "Accelerated", £ Y' is either "VI" for VI, or "GS" for GS, and 'N' is 
for one of "1", "2", and "3" for "GAVI1", "GAVI2" , and "GAVI3" . For example, LAGS1 
denotes Linear Extension Accelerated Gauss-Seidel value iteration method of type 1 (as for 
GAVI1) with w n+1 = ZTgs for Step 1. Non-accelerated versions will be shortened to VI, 
and GS without prefixes. 
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3 Numerical Studies 

In this section we present numerical studies to demonstrate the computational improvement 
that the proposed variants of two-phase accelerated value iteration algorithms achieve. The 
results are compared with Bertsecas' approach. We will consider four families of randomly 
generated MDP problems. In all cases the number of actions in each state, the immediate 
rewards for each state, and actions were generated using a uniform random number generator. 
In all examples, we first fixed the number of non-zero entries in each row, so that the density 
of non-zero entries in that row is equal to a given density level. We randomly generated 
non-zero entries according to a uniform distribution over (0,1), normalized these non-zero 
entries so that they add up to 1, and then placed them randomly across the row. 

Example 1. Consider MDPs with 50 states and up to 50 actions per state. The transition 
probability matrices were generated using a uniform random number generator and non- 
zero elements were uniformly placed in the matrix. The density of non-zero elements of 
the matrices varies from 30% to 90%. For two-phase variants of VI (columns of 3-6) the 
corresponding discounted MDP with discount factor a = 0.99 is solved. 

Example 2. Consider MDPs with 100 states and up to 20 actions per state. The transition 
probability matrices were generated using a uniform random number generator and non- 
zero elements were uniformly placed in the matrix. The density of non-zero elements of 
the matrices varies from 60% to 90%. For two-phase variants of VI (columns of 3-6) the 
corresponding discounted MDP with discount factor a = 0.99 is solved. 
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Example 3. Consider MDPs with 80 states and up to ^0 actions per state. The transition 
probability matrices were generated using a uniform random number generator and non- 
zero elements were uniformly placed in the matrix. The density of non-zero elements of 
the matrices varies from 40% to 90%. For two-phase variants of VI (columns of 3-6) the 
corresponding discounted MDP with discount factor a = 0.99 is solved. 

Example 4. Consider MDPs with 200 states and up to 30 actions per state. The transition 
probability matrices were generated using a uniform random number generator and non- 
zero elements were uniformly placed in the matrix. The density of non-zero elements of 
the matrices varies from 50% to 90%. For two-phase variants of VI (columns of 3-6) the 
corresponding discounted MDP with discount factor a = 0.99 is solved. 

As it was shown in [8] that the combinations of Projective operators with standard value 
iteration and Linear Extension operators Gauss-Seidel variant of value iteration give the 
best performance. The computational results of Examples 1-4 are presented in Table [1] and 
Table [2 



Table [T] goes here. 



Let us now present the brief analysis of the above numerical results. Based on Examples 
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1 - 4, we can conclude that the proposed variants of accelerated value iteration algorithm 
PAVI 1-PAVI 3 and LAGS 1-LAGS 3 show good performance and converge up to 75 times 
faster than corresponding Bertsecas variants of value iteration algorithms. 

For almost all of the cases in Table 1, PAVI 3 is the best algorithm. PAVI 1 gives 
almost the same good results, while PAVI2 performs relatively poorly. On the other hand, 
for most cases in Table 2, LAGS 2 is the best algorithm, while both LAGS 1 and LAGS 3 
also show very strong performance. Summarizing these numerical results, we may conclude 
that both PAVI 3 and LAGS 3 show very good performance. Besides, application of these 
methods does not require choosing a stepsize. Similarly, both PAVI 1 and LAGS 1 show good 
performance, though additional computational efforts are necessarily to obtain a stepsize. 
We should also notice that the performance of both PAVI 2 and LAGS 2 may vary. This can 
be explained by the fact that it is not guaranteed that accelerating operators can be applied 
for all iterates. Because this fact, these algorithms are sensitive to the choice of a stepsize 
and to the structure of transition probability matrix. 

Let us also notice that the performance of two-phase algorithms depends on the choice 
of the discount factor of the corresponding discounted MDP problem which is solved during 
the first phase. We have the following tradeoff: having the discount closer to 1 leads to 
the increase in the number of iterations for the first phase. On the other hand, it make 
the bounds for the optimal average reward more tight, which leads to the reduction of the 
number of iterations of the second phase. The values of upper and lower bounds of the 
optimal average reward of Examples 1-4 are presented in Table |2] 
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Table [2] goes here. 



Based on Examples 1-4, we can conclude that the two-phase accelerated value iteration 
algorithms with second phase as the standard value iteration combined with PAVI 1, PAVI2, 
and PAVI 3 show better performance than the value iteration algorithm proposed by Bert- 
secas. For the best cases the accelerated value iteration algorithms converge up to 12 times 
faster than corresponding Bertsecas algorithm. We can also conclude that two-phase acceler- 
ated value iteration algorithms with second phase as the standard value iteration combined 
with PAVI 1 for most of the cases perform better than one combined with PAVI 1 and PAVI 
3. 

4 Computational Complexity and Savings 

We now evaluate the number of the additional arithmetic operation required for application 
of the proposed accelerated operators. For the standard VI, when the transition probability 
matrices are fully dense, each iteration will take C| S*! 2 (where C is the average number of 
actions per state) multiplications and divisions. With sparse transition probability matrices, 
this number can be estimated as NC\S\ (where N is the average number of nonzero entries 
per row of the transition probability matrices). 

The additional effort required in GAVI is due to the acceleration operator used in Step 
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1 of GAVI. However, as it was discussed in [8] the acceleration step requires only C\S\ 
multiplication and division. This means that the iteration of GAVI requires CISTS' + 1| 
multiplication and division, which is just slightly more than that of standard value iteration. 
Now we will show that for stochastic shortest path problems the following reduction of 
computational complexity is possible. With either Projective Operator or Linear Extension 
Operator, a trivial 1-dimensional LP should be solved per iteration of GAVI. Let us evaluate 
the complexity of this step. Substitute the expression h k {i) + a into system of inequalities 
defining the set H X k+i. We will have the following system of inequalities 

n-1 

h\l) +a- Y,PiM)( h \j) + «) < r(l, a) " 
3=1 



n-1 

fc+1 



h k {n) + a - ^2Pnj(a-)(h k (j) + a) < r(n, a) — A 
3=1 



which can be written in the following form 



n— 1 n— 1 

a(l - £>«(«)) < r(l, a) - \ k+1 - (h k (l) - ^M)h\j)) 

3=1 3=1 



n— 1 n— 1 

a („ _ ^ Pli (a)) < r(n,a) - A fc+1 - (/i fe (n) - ^p„,(a)^(j)) 

3=1 3=1 

It is easy to notice that if h k e iiffe then the expressions in righthand side of all inequalities 
are positive. It is also easy to notice that for state i and action a e A(i) such that pu{a) = 
we have lefthandside expressions equal to and these inequalities are satisfied for all value 
of a. So, we have to evaluate the ratios only for states and actions for which pu(a) ^ 0. 

23 



For many real applications the number of such states K can be significantly less than n. 
So, the acceleration step requires only CK multiplications and divisions. Therefore, each 
iteration of GAVI may be just slightly more expensive than the standard value iteration. In 
conclusion, the additional computation due to the accelerating operators is marginal. 

5 Conclusions 

Using the monotone behavior of the contraction mapping operator used in the value iter- 
ation algorithm within the feasible set of the linear programming problem equivalent to 
the discounted MDP and stochastic shortest path models, we propose a class of operators 
that can be used in combination with the standard contraction mapping and Gauss-Seidel 
methods to improve the computational efficiency Two acceleration operators, Projective 
Operator and Linear Extension Operator, are particularly proposed and combined into the 
three variants of value iteration algorithms. The numerical studies show that the savings 
due to the acceleration have been essential and the maximum savings is up to 80 time faster 
than the case without our accelerating operator. It is especially interesting to mention that 
the savings become significant when the proposed variants of the value iteration algorithms 
for average reward MDPs are used with bounds obtained from solution of the corresponding 
discounted MDP problem. 
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A Proofs 

Lemma A.l. 

(i) Ifh>g, then for any \ k of GAVI T X kh > T X k.g and T GSxk h > T GSxk g. 

(ii) Ifh>g, then for any X k of GAVI T X kh > T xk g and T GSxk h > T GSxk g. 
Proof of Lemma \A.l\ Let us first prove that g > h implies T X kg > T x uh. 



For part (ii) the inequality can be simply replaced with a strict inequality. Let / = T GS g 
and i = T GS h. Then /(l) = (T QSx J)(l) = (T xk g)(l) < (T xk h)(l) = ^(1). By induction, 
assuming that f(k) < £(k) for all k < i, we get for k = % 



Proof of Lemma \2. 21 Keeping in mind the additional property of optimal bias P*h = 0, 
where P* is a positive matrix, and putting aside the trivial case r(i, a) = for all i and a, 
it is easy to see, that the optimal bias h should have both positive and negative coordinates 
[5]. Letting a approach to 1 in ( JT7l) . we can make the term f(a) be arbitrarily small. From 







j<i 



□ 
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this we can conclude that there exist two coordinates % and j that v a (i) < X* < v a (j) for all 
a sufficiently close to 1. □ 

Proof of Lemma \2.1\ Proof of this lemma is identical to proof of lemma 12.31 and can be 
found in the original paper [8]. □ 

Proof of Lemma \2.3[ Let h G H X k and g = T X kh. By definition of set H X k, g = T X kh > h. 
By monotonicity shown in Lemma [A. 1[ T X kg > T X kh = g. Thus, g G H X k. □ 



Proof of Lemma 2J^_. This lemma is a trivial corollary of a statement of Remark ??. Let 



h G H X k. By definition it means that 

N-l 

h(i) - y~]pij{a)h{j) < r(i, a) - \ h , for all % = 1, N. 
From Remark 12.21 we have \ k > X h+1 , so h satisfies 

N-l 

h(i) - Pij( a ) h ti) < r (h a) - for all i = 1, N, 

3=1 

which means that h G int(H n+ i) . □ 

Proof of Lemma \2. 51 The proof of (ii) is a trivial application of monotonicity lemma [AJ] 
and Condition (B) of the operator Zf.. (i) follows from Remark 12.21 (iii) As it was shown 
in [H [2] the Markovian operator T X k is a contraction mapping with respect to a weighted 
max-norm. Then it is easy to see (similar to the Remark 12.21) that the sequence h k and a 
sequence of fixed point of A fc -SSP h X k both converge to h* with respect to this norm. From 
this and inequality (ii) we immediately obtain (iii). □ 
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Proof of Theorem \2.1[ Condition (A) is satisfied trivially since h + ae G H X k for any ft G H X k 
by the definition of Z k given in ( fT9l) . Now we have to show that Z k satisfies condition (B). 
We know a = is feasible to the linear program (1191) since h G H X k (or T A fcft > ft,). By 
lemma [2731 ft G H X k+i, and we have a* > 1. Therefore, Z^ft = ft + a*e > ft. □ 

Proof of Theorem \2.2\ For ft G H X k, Z k h = h + a*(T X kh — h), where a* is an optimal solution 
to the linear program in (1201) . Since ft + a*(T X kh — ft) is feasible to the linear program, we 
have T X k(h + a*(T X kh — ft)) > ft + a*(T X kh — ft). Together with lemma 12.31 this suffices 
Condition (A). By T X kh G H X k, a = 1 is feasible. Since T A kft > ft and a = 1 is feasible, 
a* > 0. Hence, Condition (B) is satisfied. □ 

Proof of Lemma \2.b\ The proof is similar to proofs of Lemma 12.31 □ 



Proof of Lemma \2. 1\ In order to prove inclusion H X k C Hcs xk , it is sufficient to show that 
if ft < T X kh, then ft < Tcs xk h. For ft G V^, let g = Tcs xk h and / = T X kh. Then, f(j) > h(j) 
for all j and g(l) = /(l). Assume that g(k) > f(k) for all k < i, then 

#(i) = min ( r(i, a) - \ k + Vp^a)^') + Vp^a)/^') ) 

> min ( r{i,a) - \ k + J2pij{a)f{j) + ) 

> min (r(i,a) - \ k + V^(a)ft(j) + V ftj (a)/i(j) ) = {T xk v)(i) = /(i). 

By induction, / < g, implying ft < T X kh = f <g = T GS ^ k h. □ 

Proof of Lemma \2. ffl For ft G Hcs xk , let g = Tcs xk h. By Lemma lA.ll ft < Tcs xk h = g. 
By replacing "min aej4 (j)" with inequalities, similar to the argument used in the proof of 
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Lemma [2 .7\ and by h < g, we have 

i-1 iV-l 

gf(i) < r(i, a) - A fc + y^j(a)g(j) + V] PiM)Ki) for a11 « e 2, iV for all a G A(z) 
i=i j=i 
i-l AT-1 

< r(i,a) — \ k + '^^p i j(a)g(j) + pij(a)g(j) for all i E 2, N for all a G A(z), 
which is equivalent to g < T X kg, or g G -f^A fc - D 
Proof of Theorem \2.3[ By Lemma 12.71 and Lemma 12.81 

Tcs xk H X k C T G s xk H GSxk C ff A fc. 

□ 
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Figure 1: A update 
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Figure 2: Projective AVI 
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f: The density of the transition probability matrix (%). 

%: These two columns are the number of iterations of the Bertsecas value iteration algorithm with improved 
bounds obtained from the phase 1 and without them. 

Table 1: The number of iterations of the value iteration algorithms with and without an 
accelerating operator applied to a family of MDPs from Examples [1] - HJ 



33 





Density^ 


rAba 1 




rAbb 3 


ricrtsccas Cjb 1 + 


Uerstsecas 2 + 


Example 1 


30 


69 


62 


63 


346 


462 


40 


99 


63 


71 


283 


408 


50 


81 


63 


51 


on 1 

221 


o o o 
666 


on 


CO 

OZ 


c i 
51 




o a n 
Z4U 


on 1 
6\Jl 


({) 


i no 
11) O 


b4 


f ( 


oon 
zJU 


O A A 

J44 


OA 

80 


105 


65 


H7 

97 


o c o 

253 


oo o 

283 


aa 
90 


102 


4«i 


54 


1 AO 

198 


O AO 

302 


Example 2 


60 


137 


43 


116 


383 


570 


71) 


1 0*7 

lz7 


1 oo 

132 


oo 


a no 
4Uz 


ol)5 


on 


1 oo 

Izo 


87 


lUo 


o o o 
6Q6 


ooy 


90 


i on 

130 


62 


1 no 

103 


o o o 

383 


568 


Example 3 


40 


151 


70 


86 


345 


487 


CCA 

50 


97 


o7 


*70 
f2 


oo o 
328 


ill 


60 


116 


137 


125 


369 


581 


70 


81 


41 


85 


324 


491 


80 


77 


38 


82 


302 


454 


90 


82 


84 


131 


314 


470 


Example 4 


50 


118 


46 


111 


552 


787 


60 


107 


140 


97 


545 


810 


70 


119 


67 


94 


584 


807 


80 


86 


114 


129 


572 


789 


90 


97 


98 


134 


585 


837 



f : The density of the transition probability matrix (%). 

p. These two columns are the number of iterations of the Gauss-Seidel variant of Bertsecas algorithm with 
improved bounds obtained from the phase 1 and without them. 

Table 2: The number of iterations of the value iteration algorithms with and without an 
accelerating operator applied to a family of MDPs from Examples [1] - HJ 
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f: The density of the transition probability matrix (%). 

p. Two values in these columns correspond to the upper and lower bounds of the optimal average reward A* 

Table 3: Comprising of the upper and lower bounds of the optimal average reward A* of 
family of MDPs from Examples [I] - H] obtained by solving corresponding discouned MDP 
problems with various values of the discount factor a. 
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